Predicting Remote Reuse Distance Patterns in Unified Parallel C Applications
نویسندگان
چکیده
Productivity is becoming increasingly important in high performance computing. Parallel systems, as well as the problems they are being used to solve, are becoming dramatically larger and more complicated. Traditional approaches to programming for these systems, such as MPI, are being regarded as too tedious and too tied to particular machines. Languages such as Unified Parallel C attempt to simplify programming on these systems by abstracting the communication with a global shared memory, partitioned across all the threads in an application. These Partitioned Global Address Space, or PGAS, languages offer the programmer a way to specify programs in a much simpler and more portable fashion. However, performance of PGAS applications has tended to lag behind applications implemented in a more traditional way. It is hoped that cache optimizations can provide similar benefits to UPC applications as they have given single-threaded applications to close this performance gap. Memory resuse distance is a critical measure of how much an application will benefit from a cache, as well as an important piece of tuning information for enabling effective cache optimization. This research explores extending existing reuse distance analysis to remote memory accesses in UPC applications. Existing analyses store a very good approximation of the reuse distance histogram for each memory access in a program efficiently. Reuse data are collected for small test runs, and then used to predict program behavior during full runs by curve fitting the patterns seen in the training runs to a function of the problem size. Reuse data are kept for each UPC thread in a UPC application, and these data are used to predict the data for each UPC thread in a larger run. Both scaling up the problem size and the increasing the total number of UPC threads are explored for prediction. Results indicate that good predictions can be made using existing prediction algorithms. However, it is noted that choice of training threads can have a dramatic effect on the accuracy of the prediction. Therefore, a simple algorithm is also presented that partitions threads into groups with similar behavior to select threads in the training runs that will lead to good predictions in the full run. *This work is partially supported by NSF grant CCF-0833082. CHAPTER 1 Introdu tion 1.1 Motivation High performan e omputing is be oming an in reasingly important part of our daily lives. It is used to determine where oil ompanies drill for oil, to gure out what the weather will be like for the next week, to design safer buildings and vehi les. Companies save millions of dollars every year by simulating produ t designs instead of reating physi al prototypes. The movie industry relies heavily on spe ial e e ts rendered on large lusters. S ientists rely on simulations to understand nu lear rea tions without having to perform dangerous experiments with nu lear materials. In addition, the ma hines used to arry out these omputations are be oming drasti ally larger and more ompli ated. The Top500 list laims that the fastest super omputer in the world has over 200000 ores [1℄. It is made up of thousands of sixore opteron pro essors in ompute blades networked together. The ompute blades an ea h be onsidered a omputer in its own right, working together with the others to a t as one large super omputer. This lustering model of super omputer is now the dominant for e in high performan e omputing. Traditionally, appli ations written for these lusters required the programmer to expli itly manage the ommuni ation needs of the program a ross the various nodes of the luster. It was thought that the performan e needs of su h appli ations ould only be met by a human programmer arefully designing the program to minimize the ne essary ommuni ation osts. This approa h to programming for super omputers is qui kly be oming unwieldy. The produ tivity ost of requiring the appli ation programmer to manage and tune an appli ation's ommuni ation for these omplex systems is simply too high. Partitioned global address spa e languages, su h as Co-Array Fortran [2℄ and Uni ed Parallel C [3℄, attempt to address these produ tivity on erns by building a shared memory programming model for programmers to work with, delegating the task of optimizing the ne essary ommuni ation to the language implementor. While these languages do o er produ tivity improvements, implementations haven't been able to mat h the performan e of more traditional message passing setups. Various approa hes to at hing up have been tried. The UPC implementation from the University of California Berkeley [4℄ uses the GASNet network library [5℄, whi h attempts to optimize ommuni ation using various methods su h as message oales ing [6℄. Many implementations try to split syn hronous operation into an asyn hronous operation and a orresponding wait, then spread these as far apart as possible to hide ommuni ation laten y. These optimizations an 1 lead to impressive performan e gains, but there is still a performan e gap for some appli ations. One approa h hasn't been ommonly used in implementations is software a hing of remote memory operations. Ca hing has been used to great e e t in many situations to hide the ost of expensive operations. Programmers are also a ustomed to working with a hes, sin e they are so prevalent in today's CPUs. As Mar Snir pointed out in his keynote address to the PGAS2009 onferen e [7℄, programmers would like to see some kind of a hing in these languages' implementations. He demoed a a hing s heme implemented entirely in the appli ation. However, it is desirable that the a hing be done at the level of the language implementation to avoid for ing the programmer to deal with the omplexities of ommuni ation that these languages were designed to hide. This resear h takes an initial look at the possibility of using existing algorithms for single threaded appli ations designed to predi t patterns in the reuse distan es for memory operations to predi t patterns in the reuse distan es for remote memory operations in Uni ed Parallel C appli ations. It is hoped that this information ould be used to tune a he behavior for a hing remote referen es, and to enable other optimizations that rely on this information and have been su essfully used with single-threaded appli ations to work with multi-threaded UPC appli ations. 1.2 Thesis Outline The rest of this do ument is organized as follows. Chapter 2 gives a broad ba kground in instru tion-based reuse distan e analysis and Uni ed Parallel C. Chapter 3 introdu es the instrumentation, test kernels and models used to predi t remote memory behavior. Chapter 4 shows the predi tion results obtained. Finally Chapter 5 summarizes the results, looks at ways this predi tion model an be used, and possible future work to over ome some of this model's weaknesses.
منابع مشابه
Path-Based Reuse Distance Analysis
Profiling can effectively analyze program behavior and provide critical information for feedback-directed or dynamic optimizations. Based on memory profiling, reuse distance analysis has shown much promise in predicting data locality for a program using inputs other than the profiled ones. Both wholeprogram and instruction-based locality can be accurately predicted by reuse distance analysis. R...
متن کاملEvaluation of Distance and Quadratic Indices for Determination of Plant Species Distribution Pattern in Khoosef Rangelands, Birjand, Iran
One of the major issues examined in the quantitative ecology is the spatial distribution pattern of plant species. Knowledge of the spatial distribution patterns is essential to measure the level of uniformity in the surrounding environment, plant reproduction, and distribution of the seedlings, plant behavioral patterns, coexistence, allelopathic relations, and competition. Therefore, the aim ...
متن کاملMemory Management Techniques for Exploiting RDMA in PGAS Languages
Partitioned Global Address Space (PGAS) languages are a popular alternative when building applications to run on large scale parallel machines. Unified Parallel C (UPC) is a well known PGAS language that is available on most high performance computing systems. Good performance of UPC applications is often one important requirement for a system acquisition. This paper presents the memory managem...
متن کاملProject Report for CS573-Optimization Compilers Resuse Distance Analysis of the Mediabench Suite
This is the project report for the Course CS573 Optimization Compilers. The goal of the project is to study the reuse distance of several of the applications included in the MediaBench Suite. 1 Methodology The reuse distance of a memory location or variable of a program is the number of individual memory references between two consecutive references to the same memory location or variable. The ...
متن کاملFixed point results in cone metric spaces endowed with a graph
In this paper, we prove the existence of fixed point for Chatterjea type mappings under $c$-distance in cone metric spaces endowed with a graph. The main results extend, generalized and unified some fixed point theorems on $c$-distance in metric and cone metric spaces.
متن کامل